-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcdserver: separate "raft log compact" from snapshot #18235
Conversation
Signed-off-by: Clement <[email protected]>
Hi @clement2026. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/ok-to-test |
Signed-off-by: Clement <[email protected]>
/retest |
server/etcdserver/server.go
Outdated
compacti := uint64(1) | ||
if snapi > s.Cfg.SnapshotCatchUpEntries { | ||
compacti = snapi - s.Cfg.SnapshotCatchUpEntries | ||
compacti := uint64(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why change from 1 to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/etcd-io/raft/blob/ee19bbe7f9bb16fd7d874a6c08dfac319596d2ad/storage.go#L258-L260
if compactIndex > ms.lastIndex() {
getLogger().Panicf("compact %d is out of bound lastindex(%d)", compactIndex, ms.lastIndex())
}
A panic occurred at this line during local test runs. It appears the initial value of ms.lastIndex() is 0, so compacti
should default to zero.
In the main branch, a panic doesn't occur because ms
always contains more than one element when compaction starts, which isn't the case in this patch.
Great work, the change look good, but as any change in the etcd apply loop we need to be careful here.
Would be good to investigate, would you be able to collect CPU&Memory profiles? |
Found by robustness tests:
|
Thanks! I understand the importance of the apply loop. I'll investigate it and provide CPU & Memory profiles after resolving the bug and test failures. |
Signed-off-by: Clement <[email protected]>
/retest |
Thanks for fixing test, please note this PR is small, but might have big impact on etcd behavior, it's review might take a while. Would be great to repeat the tests and see the profile difference to understand performance difference. cc @ahrtr |
Thanks for checking in. I temporarily fixed the test to test my ideas, but it might cause a serious memory leak. Currently, I'm developing an optimized version to properly fixed the test. I’ll update you once it's done and provide the CPU and memory profiles. |
…all ongoing snapshots Signed-off-by: Clement <[email protected]>
abcae5f
to
93d0484
Compare
/retest |
Signed-off-by: Clement <[email protected]>
adb970c
to
302254b
Compare
/retest |
Signed-off-by: Clement <[email protected]>
I have completed some benchmarks. Here’s the heatmap comparison between this PR and the main branch: Since the full benchmark is time-consuming, the results were generated using this simplified script: export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh The script was executed on a Vultr cloud VM with 8 vCPUs, 32 GB of memory, and a 150 GB NVMe drive. All subsequent results are based on this machine configuration unless otherwise specified. Updates: |
Benchmark etcd using a small CONN_CLI_COUNT, a small VALUE_SIZE and a big RUN_COUNT(1000,000). The benchmark took about 600s, and
Images produced by executing
Compared to the main branch, this patch uses more memory and consumes less CPU. Below are the steps to reproduce. Step 1: Prepare ENV ETCD_BIN="./bin/etcd"
ETCD_BM_BIN="./bin/tools/benchmark"
BACKEND_SIZE="$((20 * 1024 * 1024 * 1024))" # 20GB
KEY_SPACE_SIZE="$((1024 * 64))"
RATIO="4.0"
KEY_SIZE="256"
RUN_COUNT="1000000"
RANGE_RESULT_LIMIT="100"
VALUE_SIZE="$((2 ** 10))" # VALUE_SIZE="$((2 ** 14))"
CONN_CLI_COUNT="$((2**5))" Step 2: start etcd with pprof enabled ${ETCD_BIN} --quota-backend-bytes=${BACKEND_SIZE} \
--log-level 'error' \
--enable-pprof Step 3: init etcd with data ${ETCD_BM_BIN} put --sequential-keys \
--key-space-size=${KEY_SPACE_SIZE} \
--val-size=${VALUE_SIZE} --key-size=${KEY_SIZE} \
--total=${KEY_SPACE_SIZE} \
&>/dev/null Step 4: capture heap usage Step 6: benchmark ${ETCD_BM_BIN} txn-mixed "" \
--conns=${CONN_CLI_COUNT} --clients=${CONN_CLI_COUNT} \
--total=${RUN_COUNT} \
--rw-ratio ${RATIO} --limit ${RANGE_RESULT_LIMIT} \
--val-size ${VALUE_SIZE} Step 7: capture heap usage |
Benchmark etcd using a big CONN_CLI_COUNT, a big VALUE_SIZE and a big RUN_COUNT(1000,000). The benchmark took about 1100s, and
Images produced by executing Compared to the main branch, this patch uses less memory and consumes less CPU. The steps to reproduce are nearly identical to the previous benchmark description, except for the different ENV settings. Step 1: Prepare ENV ETCD_BIN="./bin/etcd"
ETCD_BM_BIN="./bin/tools/benchmark"
BACKEND_SIZE="$((20 * 1024 * 1024 * 1024))" # 20GB
KEY_SPACE_SIZE="$((1024 * 64))"
RATIO="4.0"
KEY_SIZE="256"
RUN_COUNT="1000000"
RANGE_RESULT_LIMIT="100"
VALUE_SIZE="$((2 ** 14))"
CONN_CLI_COUNT="$((2**11))" |
I am still struggling to understand why this patch reduced the |
@serathius |
@clement2026 great work, it's a lot of information and it should be enough to make some conclusions. I think what we are missing is a clear summarization of your findings. What test we did, what improvement/regresion the test showed. Please see kubernetes/enhancements#2340 (comment) for an example |
Thanks for the example. I’ll create a summary of the findings, including the tests conducted and their results. |
@serathius Here’s a summary of the three tests I’ve done. Please review it when you have a moment. Test 1 #18235 (comment)
Improvements
Test 2 #18235 (comment)
Improvements
Test 31 KB: #18235 (comment)
Improvements
Degradations
Updates:
|
I find testing and recording the memory usage of etcd to be quite time-consuming. We already have rw-benchmark.sh for collecting read/write performance data. Are there similar scripts for gathering memory usage statistics? |
I feel your pain, it shouldn't take so much time to iterate to validate performance improvement. There is an open issue to improve the current situation, however there hasn't been a lot of progress. #16467 The change looks overall good to me. Usage of heap might be overkill here, I don't expect too many concurrent snapshots being executed, however users might set snapshot count very low, which for a big db file could cause many concurrent snapshot attempts. The improvement in performance is awesome! Don't think that increase in memory for small objects is surprising or concerning. Executing compaction after each apply might be overkill. Would you want to try checking if reducing frequency to once every 100/1000 would be better? Happy to see the change passes the robustness test. We might want to also try to think if we could add some more testing to ensure this change is correct. cc @ahrtr to confirm change correctness and performance benefits. |
This is great advice. I'll reduce the compaction frequency to once every 100/1000 runs and observe the benchmark results. With such a low frequency, replacing the min heap with a simple slice should not impact performance and will enhance readability and maintenance. |
Thanks for the PR, but I think the complexity introduced in this PR outweighs the benefit. We should always consider simplify ing the core apply workflow, so as to improve the understandability and maintainability. Previously the creating snapshot is performed in a separate goroutine async (see below link), because the v2 snapshot contained all key data, so it might be huge and accordingly it may take a long time to create a snapshot. But it isn't that case anymore, v2store & v2 snapshot only contains membership data now. etcd/server/etcdserver/server.go Line 2129 in 1b25193
So I think we can start with turning it into a sync operation, and a direct & immediate benefit is big simplification. But we need to evaluate the impact on performance. Afterwards, we can discuss any further improvement step by step. |
I wouldn't phrase it as "complexity outweighs benefits", however I would agree that should consider simplify code first. It's a good suggestion to start from making operation synchronous. @clement2026 Could you prepare a separate PR that changes the snapshot + compact into sync operation with apply loop and see how this impacts performance? Removing GoAttach should be enough. Please let me know if you need any help |
@ahrtr @serathius Wow, that sounds great! Converting snapshot and compact into sync operations should significantly simplify things. I'll start with the sync operation in a new PR and see how it goes. If everything works well, I'll rewrite this PR based on that one. |
Out of curiosity, @clement2026, what parameters did you use to run your benchmarks? And how long does it take to run them? |
@ivanvc Initially, I ran the script with the default parameters, but it didn't complete overnight. I then terminated the script and adjusted the parameters to ensure it would finish within a limited timeframe. With the modified parameters, the script completed in a few hours(I did not record the exact time): export RATIO_LIST="4/1"
export REPEAT_COUNT=3
export RUN_COUNT=50000
./rw-benchmark.sh I ran the same script twice last night to benchmark another PR, taking a total of 4 hours and 13 minutes (approximately 2 hours and 6 minutes per run). FYI: #18283 (comment) |
I think this PR demonstrates the great potential benefits for this idea. Should the change be guarded by a feature gate to be safe? |
@clement2026 With #18283 merged what would be the best way to move forward? Is this PR still relevant? cc @ahrtr |
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Based on #18283, we can implement this feature with less code and make it simpler. So, we don't need this PR anymore. I’ll create a new PR to keep things moving forward. Thanks, everyone, for taking the time to review it! Closing this. |
I'm looking into #17098 @serathius and made changes to separate the compact logic from the snapshot logic.
I ran
rw-benchmark.sh
several times to observe performance changes. I expected this patch to slow down etcd, but it seems to have improved rw-benchmark.sh results. I’m not sure how. Similar results were observed on both cloud VMs (8vCPU, 16GB Memory, NVMe) and a bare-metal machine (8CPU, 32GB Memory, SSD).Since the
rw-benchmark.sh
script was taking forever to finish, I adjusted the parameters to complete the benchmark within a few hours.Below are the script and the results from running it on a Vultr bare-metal machine (8 CPUs, 32GB Memory, SSD, Ubuntu 24.04 LTS x64).
main.csv
issue-17098-patch.csv